Implementing a Bayesian Approach to Record Linkage
نویسنده
چکیده
The Census Coverage Measurement survey-based program estimated household population coverage of the 2010 Decennial Census. Calculating coverage estimates required linking survey person data to census enumerations. For record linkage research, we applied a Bayesian Latent Class Models approach to both 2010 coverage survey data and simulated household data. This paper presents our use of Base SAS to implement the Bayesian approach. It also discusses coding adaptations to handle changes including removing hard-coded variable names to allow for varying input parameters. DISCLAIMER This report is released to inform interested parties of ongoing research and to encourage discussion of work in progress. Any views expressed are those of the authors and not necessarily those of the U.S. Census Bureau. INTRODUCTION In general, record linkage methods use comparisons (agreement patterns) of common fields to define the match status of linked records from two or more files. A given record can be linked numerous times (many-to-many matches) or restricted to matching only one other record (one-to-one match). Researchers use the combined data in a variety of ways, such as producing coverage estimates or identifying duplicate records. Larsen (2009) proposed a Bayesian record linkage application with many-to-many matches that pairs records from two files. The method builds on agreement patterns from two latent classes (matches and nonmatches) and makes the conditional independence assumption of comparison fields (variables common to both files, such as age). This approach does not allow parameters to vary by block (blocks are a group of linked pairs that agree on at least one variable, such as agreement by phone number). In this paper, we present the Bayesian approach proposed by Larsen, discuss implementing the method and show results. The implementation and results discussions include details on how the code developed during research. BAYESIAN APPROACH TO RECORD LINKAGE The Bayesian approaches presented in Larsen’s 2009 paper link records from two files (A and B). A linked pair of records from files A and B are referred to as (a,b). Each linked pair has k comparison fields (name, age,...) with agreement levels (for example, agree/disagree) defined as γk(a, b). The agreement pattern of the comparison fields is stored in a vector (γ(a, b) = {γ1(a, b), ... }). A linked pair’s match status is defined as I(a,b) = 1 for match and I(a,b) = 0 for nonmatch. In this section, we present Larsen’s Bayesian method (described in Section 3.1, Bayesian Approach to Latent Class Record Linkage Models) that we implemented in Base SAS. This approach models the probability of an agreement pattern (Pr(γ)=Pr(γ|M)pm+Pr(γ|U)pu ) from two latent classes (matches and nonmatches), makes the conditional independence assumption of comparison fields and uses Gibbs sampling to simulate posterior distributions. It does not allow parameters to vary by block or force one-to-one matches. The method is as follows. A. Select initial values of unknown parameters (initial parameters were based on previous survey matching results). Implementing a Bayesian Approach to Record Linkage, continued SESUG 2015 2 B. Repeat the following steps until convergence: 1. For each linked pair of records, draw values for the match status (“I”) independently from a Bernoulli distribution with where ppmm= probability of match given match status ppuu = probability of nonmatch given match status PPPP(γγ(aa, bb)|MM)= probability of observing agreement pattern among matches PPPP(γγ(aa, bb)|UU)= probability of observing agreement pattern among nonmatches 2. Given match status, define new values for the probability of match given agreement pattern,PPPP (MM|γγ(aa, bb)). • Draw probability of match, ppmm. Set probability of nonmatch ( ppuu) to 1-ppmm. • Calculate probability of observing agreement pattern among matches, PPPP(γγ(aa, bb)|MM). For every k comparison field, draw the probability of agreement given match. • Calculate probability of observing agreement pattern among nonmatches, PPPP(γγ(aa, bb)|UU). For every k comparison field, draw the probability of agreement given nonmatch. PPPP (MM|γγ(aa, bb)) = ppmmPPPP(γγ(aa, bb)|MM) ppmmPPPP(γγ(aa,bb)|MM) + ppuuPPPP(γγ(aa, bb)|UU) ppmm|II~BBBBBBaa �∝MM+ � II(aa, bb) ,ββMM + (aa,bb) ��1 − II(aa, bb)� (aa,bb) � PPPP(γγkk(aa, bb) = 1|MM, II)~BBBBBBaa �∝MMkk+ � IIaa,bbγγkk(aa, bb) ,ββMMkk + (aa,bb) � IIaa,bb�1 − γγkk(aa, bb)� (aa,bb) � PPPP(γγkk(aa, bb) = 1|UU, II)~BBBBBBaa �∝UUkk+ ��1 − IIaa,bb�γγkk(aa, bb) ,ββUUkk + (aa,bb) ��1 − IIaa,bb��1− γγkk(aa, bb)� (aa,bb) � PPPP(γγ(aa, bb)|MM) = �PPPP(γγkk|MM)kk kk (1 − PPPP (γγkk|MM))kk PPPP(γγ(aa, bb)|UU) = �PPPP(γγkk|UU)kk kk (1 − PPPP (γγkk|UU))kk Implementing a Bayesian Approach to Record Linkage, continued SESUG 2015 3 IMPLEMENTATION PROCESS We tackled programming the algorithm with a series of Base SAS statements and macro language. DATA steps and macros gave us desired control of programming. The control made it easier for us to test and evaluate the algorithm’s detailed processing steps. For more on processing the algorithm, see the appendix. The initial code processed simulated household data with specific characteristics. During our research, code requirements expanded to accommodate processing survey data and shorten run times. To meet the additional requirements, we made the following code modifications: • Referencing Variable Names of Comparison Fields Comparison fields can vary for each application. To address the possible changes, we modified our code during research. When we initially designed the program, variable names of comparison fields were hard-coded throughout the algorithm. It made the code easy to follow but unable to process different comparison fields unless a user hard coded references. To avoid replacing references, we revised the hard-coded variable names with generic macro variables (&char1, &char2, ...). Once the code processed generic references, we only had to modify one line of code to change comparison fields (%let varfields= ...). Example A shows setup of the macro variables for three comparison fields: first name, last name and age. Users modify bolded text to process different comparison fields. Example A: %macro setup; data _null_; %do j=1 %to &numvars; call symput("char&j","%scan(&varfields,&j)"); %end; run; %mend setup; /* comparison fields variable names */ %let varfields=FNAME LNAME AGE; /* number of comparison fields */ %let numvars=%sysfunc(countw(&varfields));
منابع مشابه
Comparison of two QTL mapping approaches based on Bayesian inference using high-dense SNPs markers
To compare different QTL mapping methods, a population with genotypic and phenotypic data was simulated. In Bayesian approach, all information of markers can be used along with combination of distributions of SNP markers. It is assumed that most of the markers (95%) have minor effects and a few numbers of markers (5%) exert major effects. The simulated population included a basic population of ...
متن کاملHierarchical Bayesian Record Linkage Theory
In record linkage, or exact file matching, one compares two or more files on a single population for purposes of unduplication or production of an enhanced, merged database. Record linkage has many applications, including in population enumeration efforts, to create databases for epidemiological investigations, and to improve survey sample frames. Latent class and mixture models have been used ...
متن کاملRecord Linkage Modeling in Federal Statistical Databases
Record linkage (e.g., Fellegi and Sunter 1969, Newcombe et al. 1959) involves comparing two or more files on the same population for purposes of unduplication of records and merging files. Record linkage is used in many applications, including population size estimation at the U.S. Census Bureau (Winkler 1994, 1995, and Jaro 1989, 1995), epidemiology and medical studies (Newcombe 1988, Gill 199...
متن کاملBayesian Estimation of Bipartite Matchings for Record Linkage
The bipartite record linkage task consists of merging two disparate datafiles containing information on two overlapping sets of entities. This is non-trivial in the absence of unique identifiers and it is important for a wide variety of applications given that it needs to be solved whenever we have to combine information from different sources. Most statistical techniques currently used for rec...
متن کاملSome advances on Bayesian record linkage and inference for linked data
In this paper we review some recent advances on Bayesian methodology for performing Record Linkage and for making inference using the resulting matched units. In particular we frame the record linkage issue into a formal inferential problem and we adapt standard model selection techniques to this context. Although the methodology is quite general, we will focus on the simple multiple regression...
متن کامل